The following project will analyze the TMDb movie data. This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. In particular, we will be interested in which genres are most popular from year to year, and what kinds of properties are associated with movies that have high revenues.
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
# Reading data
df = pd.read_csv('tmdb-movies.csv')
# Reading the first 2 columns of data
df.head(2)
# Let's check the size of this dataset
df.shape
# Since not all columns are shown in the Notebook, let's check all the columns' names
df.columns
This dataset contains 21 columns, with the following detail:
It's worth noting that the popularity column, according to IMDB website, takes into account the following factors for movies:
# Let's print a summary of our dataframe
df.info()
We see that there are missing values in the following columns:
We also see that there are columns with incorrect data types, or with a data type different than what we require:
# Let's look at the unique values of each column
df.nunique()
# Let's see how many duplicated rows (only rows with identical values in every column) are in our dataset
df.duplicated().sum()
This info provides some light in possible duplicated values when there should not be:
# Let's see the first 2 rows of "keywords" column
df.keywords.head(2)
Also, it's worth noting that certain columns ("cast", "genres", "keywords", "director" and "production_companies") contain multiple values separated by pipe (|) character, that we would need to preprocess before doing our analysis.
The first step in our cleaning process is to remove columns that will not be used in the analysis. This will allow our scripts to be more efficient and will give us a more clean table to analyze. The reasoning for each drop is as follows:
# Dropping columns
df.drop(['budget', 'revenue', 'homepage', 'cast','tagline', 'keywords', 'overview', 'production_companies', 'release_date', 'vote_count'], axis=1, inplace=True)
# Let's check the first row of our updated table
df.head(1)
Coming up, we will analyze duplicated rows.
# Let's see the only duplicated value from this dataset
df[df.duplicated(keep=False)]
Since this seems like a repetitive data and it's only one, we will drop the last one.
# Dropping duplicated row
df.drop_duplicates(inplace=True)
# Checking the duplicated row is gone
df[df.duplicated(keep=False)]
# Let's see the summary of the updated table
df.info()
# Let's check the unique values of the updated table
df.nunique()
Since the "id" column now have the exact amount of unique rows as amount of rows in the dataset, we no longer need the "imdb_id" column. So we will drop it too.
# Dropping the "imdb_id" column
df.drop(['imdb_id'], axis=1, inplace=True)
# Let's see how our columns look
df.head(2)
# Let's count the missing values in each column
df.isnull().sum()
# Let's look at the proportions of the missing values in each column
(df.isnull().sum() / df.shape[0]) * 100
It looks like we have:
All of these are categorical variables, so we cannot impute them as easily as with the numerical variables (with mean or median).
Since we are going to use the "genres" column for our analysis, let's first look deeper to those missing values in that column:
# Let's look at the row values where "genres" column is null
df[df.genres.isnull()].head(10)
# Let's look at the histogram of the whole dataframe
df.hist(figsize=(10,8));
# Let's also look at the histograms of the data where the "genres" column is null
df[df.genres.isnull()].hist(figsize=(10,8));
It looks like where the "genres" column has NULL values, it also has zero value in "budget_adj" and "revenue_adj" (in most cases), when there should be some minimum number at least.
Also, where the "genres" column is empty, other column like "director" is also empty.
So, considering that and the fact that rows with missing values in "genres" column are only 0.21% of the whole dataset, we will remove those rows.
# Removing rows where "genres" is null
df = df[df['genres'].notna()]
# Let's check that rows where "genres" column is null are gone
df[df.genres.isnull()]
# Let's count once again the missing values in each column
df.isnull().sum()
# Let's also look at the proportions of the missing values in each column
(df.isnull().sum() / df.shape[0]) * 100
It looks like we still have:
Since we are going to use the "director" column for our analysis, let's look deeper to the missing values in that column:
# Let's look at the row values where "director" column is null
df[df.director.isnull()].head(10)
# Let's look at the histograms of the data where the "director" column is null
df[df.director.isnull()].hist(figsize=(10,8));
It looks like where the "director" column has NULL values, it also has zero in "budget_adj" and "revenue_adj" (in most cases), when there should be some minimum number at least.
So, considering that and the fact that rows with missing values in "director" column are only 0.39% of the whole dataset, we will remove those rows.
# Removing rows where "director" is null
df = df[df['director'].notna()]
# Let's check that rows where "director" column is null are gone
df[df.director.isnull()]
# Let's double check there is no missing data left
df.isnull().sum()
We now have our dataset without missing values. But in the process, we have realized that there were movies with "budget_adj" and "revenue_adj" with zero value, where there should be at least a minimum value. So let's see if there are still movies left with that pattern.
# Finding rows with zero budget or zero revenue
missing_budget_revenue = df[(df['budget_adj']==float(0)) | (df['revenue_adj']==float(0))]
# Let's see the shape of this "missing_budget_revenue" table
missing_budget_revenue.shape
# Let's see the top 5 rows with zero budget or zero revenue
missing_budget_revenue.head(5)
# Let's look at the histogram of rows with zero budget or zero revenue
missing_budget_revenue.hist(figsize=(10,8));
# Let's calculate the proportion of rows with zero budget or zero revenue
(missing_budget_revenue.shape[0]) / (df.shape[0]) * 100
# Finding rows without zero budget nor zero revenue
no_missing_budget_revenue = df[(df['budget_adj']!=float(0)) & (df['revenue_adj']!=float(0))]
# Let's see the shape of this "no_missing_budget_revenue" table
no_missing_budget_revenue.shape
# Let's see the top 5 rows without zero budget nor zero revenue
no_missing_budget_revenue.head(5)
# Let's look at the histogram of rows without zero budget nor zero revenue
no_missing_budget_revenue.hist(figsize=(10,8));
Seems like the data that have either zero budget or zero revenue have similar behavior than the rest of the data. Also, this data with either zero budget or zero revenue represents a 64% of the dataset (after cleaning some missing and duplicated values).
However, since our analysis goes around the movies with higest revenues, it is important to analyze ONLY the data that has revenues different than zero. Also, we need movies that have budget different than zero, because it is only coherent for it have at least some amount of budget for the production, cast, marketing, etc.
Therefore we will use the "no_missing_budget_revenue" table for our analysis.
# Assigning the "no_missing_budget_revenue" data to our dataset
df = no_missing_budget_revenue
# Let's check top5 rows of our new dataset
df.head(5)
# Let's check the shape of this new dataset
df.shape
Before moving forward let's fix the datatypes from the remaining variables
# Let's check the summary of our updated table
df.info()
We are going to convert "id" column to a string type as we will not perform numeric calculations with it.
# Fixing datatypes
df = df.astype({'id':'str'})
df.dtypes
We now have our dataset completely clean and ready for analysis.
# Let's first quickly review the histogram of all the numerical variables in the dataset
df.hist(figsize=(10,8));
# Let's define the function for Histograms of Numerical Continuous Values
def func_hist(x,var_name, xlabel,ylabel):
fig,ax = plt.subplots(figsize=(15,8))
plt.hist(x, bins='auto')
plt.title('Histogram of {}'.format(var_name))
plt.xlabel(xlabel)
plt.ylabel(ylabel)
ax.set_xlim(0,)
ax.set_ylim(0,)
return plt.show();
It's importat to notice that we chose the "auto" method to select the number of bins of histograms. These auto method chooses the maximum number of bins between the "Sturges" and "Freedman Diaconis" estimators.
# Let's see the histogram in detail for budget_adj
func_hist(df['budget_adj']/1e6,'Budget in terms of 2010 dollars', 'Budget in terms of 2010 dollars (values in millions)', 'Amount of movies')
We can see that the distribution of Budget in terms of 2010 dollars is skewed to the right. This shows that most of the movies in our clean dataset have a lower budget (less than 100 millions) compared to the minimum amount of movies who have a budget over 100 millions. This high-budget movies can even have a budget over 400 millions.
# Let's define the function for Boxplot of Numerical Continuous Values
def func_box(X, var_name, xlabel):
fig, ax = plt.subplots(figsize=(16,6))
bp = plt.boxplot(X, vert=False, showmeans=True)
plt.title('Boxplot of {}'.format(var_name))
plt.xlabel(xlabel)
plt.yticks([1],["Boxplot"], rotation=90, verticalalignment="center")
ax.grid(b=True, axis='x')
# Adding mean label
if abs(X.mean() - X.median()) < 0.1:
ax.text(X.mean(), 1.02, '%.2f' %(X.mean()),color='green')
else:
ax.text(X.mean(), 1.02, '%.1f' %(X.mean()),color='green')
# Adding median label
x, y = bp['medians'][0].get_xydata()[1]
if abs(X.mean() - X.median()) < 0.1:
ax.text(x, y+0.01, '%.2f' % x, horizontalalignment='center')
else:
ax.text(x, y+0.01, '%.1f' % x, horizontalalignment='center')
# Adding quantiles labels
x, y = bp['boxes'][0].get_xydata()[0]
ax.text(x,y-0.01, '%.1f' % x, horizontalalignment='center', verticalalignment='top')
x, y = bp['boxes'][0].get_xydata()[3]
ax.text(x,y-0.01, '%.1f' % x, horizontalalignment='center', verticalalignment='top')
# Adding whiskers labels
x, y = bp['caps'][0].get_xydata()[0]
ax.text(x,y-0.01, '%.1f' % x, horizontalalignment='center', verticalalignment='top')
x, y = bp['whiskers'][1].get_xydata()[1]
ax.text(x,y-0.05, '%.1f' % x, horizontalalignment='center', verticalalignment='top')
# Setting xlim
if X.min() > 0.01:
ax.set_xlim(0,)
return plt.show();
# Let's see the boxplot of budjet_adj
func_box(df['budget_adj']/1e6, 'Budget in terms of 2010 dollars', 'Budget in terms of 2010 dollars (values in millions)')
If we look at the boxplot of Budget in terms of 2010 dollars, we can see that the mayority of movies in our clean dataset (between 25% and 75% percentile) had a budget between 13.1 and 60.6 millions of dollars.
Also, its worth noticing that since we have a great amount of outliers with high values, the mean is much greater than the median. We should take this into consideration when deciding to use median or mean for our analysis.
# Let's see the histogram in detail for revenue_adj
func_hist(df['revenue_adj']/1e6,'Revenue in terms of 2010 dollars', 'Revenue in terms of 2010 dollars (values in millions)', 'Amount of movies')
We can see that the distribution of Revenue in terms of 2010 dollars is also skewed to the right. This shows that most of the movies in our clean dataset have a lower revenue (less than 500 millions) compared to the minimum amount of movies who have a revenue over 500 millions. This high-revenue movies can even have a revenue over 2,500 millions.
# Let's see the boxplot of revenue_adj
func_box(df['revenue_adj']/1e6, 'Revenue in terms of 2010 dollars', 'Revenue in terms of 2010 dollars (values in millions)')
If we look at the boxplot of Revenue in terms of 2010 dollar, we can see that the mayority of movies in our clean dataset (between 25% and 75% percentile) had a revenue between 18.4 and 163.3 millions of dollars.
Also, its worth noticing that since we have a great amount of outliers with high values, the mean is much greater than the median. We should take this into consideration when deciding to use median or mean for our analysis.
# Let's see the histogram in detail for popularity
func_hist(df['popularity'],'Popularity', 'Popularity', 'Amount of movies')
We can see that the distribution of Popularity is also skewed to the right. This shows that most of the movies in our clean dataset have a low popularity (less than 5) compared to the minimum amount of movies with popularity over 5. This high-popularity movies can even have a popularity score above 30.
# Let's see the boxplot of popularity
func_box(df['popularity'], 'Popularity', 'Popularity')
If we look at the boxplot of Popularity, we can see that the mayority of movies in our clean dataset (between 25% and 75% percentile) had a popularity between 0.5 and 1.4.
Also, it's worth noticing that since we have a great amount of outliers with high values, the mean is greater than the median. We should take this into consideration when deciding to use median or mean for our analysis.
# Let's see the histogram in detail for Vote Average
func_hist(df['vote_average'],'Vote average', 'Vote average', 'Amount of movies')
Since our histogram looks similar to a normal distribution, we will use a density plot to analyze it.
# Let's define the density function for numerical continuous variables
def func_density(x, var_name, xlabel):
fig, ax = plt.subplots(figsize=(16,6))
sns.kdeplot(x, shade=True)
plt.title('Density Plot of {}'.format(var_name))
plt.xlabel(xlabel)
plt.ylabel('Density')
ax.set_xlim(0,)
ax.set_ylim(0,)
return plt.show();
It's worth noticing that we kept the default values for kernell (gaussian method) and binwidth (scott method) for the density plot.
# Let's plot the density function for vote_average
func_density(df['vote_average'],'Vote average per movie', 'Vote average per movie')
# Let's see the boxplot of vote_average
func_box(df['vote_average'], 'Vote average per movie', 'Vote average per movie')
From the density plot and boxplot, we can see that the distribution of Vote Average per movie looks similar to a normal distribution, with a similar mean and median. The mean is just 0.03 lower than the median due to the outliers on the left (movies with low vote average).
The boxplot also shows that the mayority of movies (between 25% and 75% percentile), from our clean dataset, have a vote average between 5.7 and 6.7 points.
# Let's check the unique values of "runtime" column
df['runtime'].nunique()
Since runtime has 139 unique values, even though it's a discrete variable, it would be best to use a histogram rather than a bar chart to analyze its distribution.
# Let's see the histogram in detail for Runtime
func_hist(df['runtime'],'Running time', 'Running time (in minutes)', 'Amount of movies')
We can see that the distribution of Running Time is also skewed to the right.
# Let's see the boxplot of runtime
func_box(df['runtime'], 'Running time', 'Running time (in minutes)')
If we look at the boxplot of Runtime, we can see that the mayority of movies in our clean dataset (between 25% and 75% percentile) had a running time between 95 and 119 minutes.
Also, its worth noticing that since we have a greater amount of outliers to the right, the mean is greater than the median. We should take this into consideration when deciding to use median or mean for our analysis.
# Let's check the amount of years analyzed
df.release_year.nunique()
# Let's define the function for bar chart
def func_bar(X,title,xlabel,ylabel):
fig, ax = plt.subplots(figsize=(16,6))
X.plot.bar(color='lightblue')
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
ax.set_xlabel(xlabel, labelpad=15)
ax.set_ylabel(ylabel, labelpad=15)
return plt.show();
# Let's plot the bar chart
ej1 = df.groupby('release_year').size()
func_bar(ej1, 'Bar chart of Amount of Movies released per year', 'Year of Release', 'Amount of Movies')
# Let's see how many more movies were released in 2015 compared to 1960
diff = (ej1[2015] - ej1[1960]) / ej1[1960]
diff
From the bar chart above we can see that, in our clean dataset, the amount of movies released per year has increased through the years. For example, the amount of movies released in 2015 is 31 times higher than the amount released in 1960.
# Sorting the values by highest revenues first
df.sort_values(by=['revenue_adj'], ascending=False, inplace=True)
# Selecting top 10 movies with highest revenues
top_10 = df[[ 'original_title','revenue_adj']].head(10).reset_index(drop=True)
top_10
# Let's check the average revenue_adj in the movies of our clean dataset
aver_rev = df['revenue_adj'].mean()
aver_rev
# Let's define a new function for Bar Chart with mean value displayed, ticks sorted and labels in each bar
def func_bar1(X, Y, title, xlabel, ylabel, text_x, text_y, word_length, mean, meanlabel):
fig, ax = plt.subplots(figsize=(16,6))
plt.bar(np.arange(len(X)),Y,color='lightblue')
if len(X)>10:
plt.xticks(np.arange(len(X)),X.str[:word_length]+'\n'+X.str[word_length:],rotation=90)
else:
plt.xticks(np.arange(len(X)),X.str[:word_length]+'\n'+X.str[word_length:],rotation=0)
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
ax.set_xlabel(xlabel, labelpad=15)
ax.set_ylabel(ylabel, labelpad=15)
for i,v in enumerate(Y):
if Y.dtypes == "int64":
ax.text(i - text_x, v + text_y,str(v), color='blue')
else:
ax.text(i - text_x, v + text_y,'{:,.2f}'.format(v), color='blue')
if mean is not None and meanlabel is not None:
plt.hlines(mean, xmin=-0.4, xmax=len(X)-0.6, linestyles='dashed', label=meanlabel)
ax.text(len(X) - 0.6 - text_x, mean + text_y*2,'{:,.2f}'.format(mean), color='black')
plt.legend()
return plt.show();
# Let's plot our Top 10 of movies based of revenues
func_bar1(top_10['original_title'], top_10['revenue_adj']/1e6, 'Top 10 movies with highest revenues ', 'Movies\' original title', 'Revenue in terms of 2010 dollars (values in millions)', 0.25, 10, 12, aver_rev/1e6, 'General Movie Mean')
# Let's check how higher is 'Avatar' movie's revenue compared to the general mean
avengers_rev = top_10[top_10['original_title']=='Avatar']['revenue_adj']
diff_revenue = (avengers_rev - aver_rev)/aver_rev
diff_revenue
In our clean dataset, the movie with the highest revenue is Avatar, with a revenue of 2.8 billon dollars (in terms of 2010 dollars).
This movie had a revenue almost 20 times higher than the general average.
# Let's select the data to plot
ej2 = (df.groupby('release_year')['revenue_adj'].sum())/1e9
# Let's plot the bar chart
func_bar(ej2, 'Bar chart of Annual Revenues per year', 'Year of Release', 'Annual Revenues in terms of 2010 dollar (values in billions)')
# Let's see how much more annual revenue the movies received in 2015 compared to 1960
diff = (ej2[2015] - ej2[1960]) / ej2[1960]
diff
From the bar chart above we can see that, in our clean dataset, the amount of annual revenue (in terms of 2010 dollars) of movies has increased through the years. For example, the annual revenue of 2015 is 24 times higher than the annual revenue of 1960.
We will first separate the movies with the highest revenues.
In order to do that, for our minimum cut we will use the median instead of the mean, as there are too many outliers with high values.
# Let's create a table for movies with higher revenue
high_rev = df[df['revenue_adj'] >= df['revenue_adj'].median()]
# Let's check the filter worked
(high_rev['revenue_adj']/1e6).min()
# Let's define a scatter matrix function
def func_scatter_matrix(X, title):
axes = pd.plotting.scatter_matrix(X, alpha = 0.9, figsize = (16, 16), diagonal = 'kde', color='lightblue')
plt.suptitle(title, y=0.9)
# Adding the corelation coefficient
corr = X.corr().values
for i, v in zip(*plt.np.triu_indices_from(axes, k=1)):
axes[i, v].annotate("r: %.4f" %corr[i,v], (0.8, 0.8), xycoords='axes fraction', ha='center', va='center', color='black')
return plt.show();
# Let's create a scatter matrix to analyze the numerical variables of these higher revenues
func_scatter_matrix(high_rev, 'Scatter Matrix of Higher Revenue Movies (with outliers)')
In this scatter matrix, we can spot a weak positive correlation between the following variables:
However, we don't see any strong linear correlation between any numerical values and higher revenues (in terms of 2010 dollars).
We then proceed to analyze the data without the outliers to check if it affects the results.
# Let's calculate the IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
# Let's select only the numerical columns
num_col = ['popularity', 'runtime', 'vote_average', 'release_year', 'budget_adj', 'revenue_adj']
# Let's create a table for movies without the outliers
df_wo_out = df[~((df[num_col] < (Q1 - 1.5 * IQR)) | (df[num_col] > (Q3 + 1.5 * IQR))).any(axis=1)]
# Let's check the table was created successfully
df_wo_out.shape
Note that around 700 rows considered outliers were dropped.
# Let's select only the highest revenue movies
high_rev1 = df_wo_out[df_wo_out['revenue_adj'] >= df['revenue_adj'].median()]
# Let's check the filter worked
(high_rev1['revenue_adj']/1e6).min()
Note that we still considered the median from the "df" table as our minimum high value, in order to make both scatter matrix comparable.
# Let's create a scatter matrix to analyze the numerical variables of these higher revenues (without outliers)
func_scatter_matrix(high_rev1, 'Scatter Matrix of Higher Revenue Movies (without outliers)')
In this new scatter matrix, we can spot some weak linear correlations between the following variables:
However, even after removing the outliers from our table, there is not strong linear correlation between any of the numerical variables and higher revenues (in terms of 2010 dollars).
Answering the questions:
2.1. Is budget associated with higher revenues?
There is not a linear correlation between budget and higher revenues.
2.2. Is popularity associated with higher revenues?
There is not a linear correlation between popularity and higher revenues.
2.3. Is vote_average associated with higher revenues?
There is not a linear correlation between vote_average and higher revenues.
2.4. Is runtime associated with higher revenues?
There is not a linear correlation between runtime and higher revenues.
# Let's select the data to plot
ej3 = df.groupby('release_year')['runtime'].mean()
ej3.head(5)
# Let's plot the bar chart
func_bar(ej3, 'Bar chart of Average Movie Runtime per year', 'Year of Release', 'Average movie running time (in minutes)')
Given the bar chart above, it looks like the average movie runtime (in our clean dataset) between 1960 and 1965 was higher than the rest of the year.
But, let's analyze if that changes when we remove the outliers.
# Let's select the data (without outliers) to plot
ej4 = df_wo_out.groupby('release_year')['runtime'].mean()
ej4.head(5)
# Let's plot the new bar chart
func_bar(ej4, 'Bar chart of Average Movie Runtime per year (without outliers)', 'Year of Release', 'Average movie running time (in minutes)')
It looks like the average running time between 1960 and 1965 were in fact outliers.
However, overall, it looks like the average movie runtime has not changed much through the years, in our clean dataset.
# Let's create a scatter matrix to analyze if there is any variable associated with popularity and vote average (with outliers)
func_scatter_matrix(df, 'Scatter Matrix of Movies (with outliers)')
# Let's create a scatter matrix without outliers, to see if there is any change
func_scatter_matrix(df_wo_out, 'Scatter Matrix (without outliers)')
After analyzing the scatter matrix of the clean dataset, with and without outliers, it looks like there is not strong linear correlation between any of the numerical variables and popularity or vote_average.
Answering the questions:
4.1. Is popularity associated with vote_average?
There is not a linear correlation between popularity and vote_average.
4.2. Is runtime associated with vote_average?
There is not a linear correlation between runtime and vote_average.
4.3. Is release_year associated with vote_average?
There is not a linear correlation between release_year and vote_average.
4.4. Is budget_adj associated with vote_average?
There is not a linear correlation between budget_adj and vote_average.
4.5. Is runtime associated with popularity?
There is not a linear correlation between runtime and popularity.
4.6. Is release_year associated with popularity?
There is not a linear correlation between release_year and popularity.
4.7. Is budget_adj associated with popularity?
There is not a linear correlation between budget_adj and popularity.
# Let's check the average popularity in the movies of our clean dataset
aver_pop = df['popularity'].mean()
aver_pop
# Let's find the top 5 directors with highest popularity in our clean dataset
pop_dir = df.groupby('director', as_index=False)['popularity'].mean()
pop_dir = pop_dir.sort_values(by=['popularity'], ascending=False).head(5)
pop_dir
# Let's check the average vote_average in the movies of our clean dataset
aver_vote = df['vote_average'].mean()
aver_vote
# Let's find the top 5 directors with highest vote_average in our clean dataset
highvote_dir = df.groupby('director', as_index=False)['vote_average'].mean()
highvote_dir = highvote_dir.sort_values(by=['vote_average'], ascending=False).head(5)
highvote_dir
# Let's check the average revenue_adj in the movies of our clean dataset
aver_rev = df['revenue_adj'].mean()
aver_rev
# Let's find the top 5 directors with highest revenue movies in our clean dataset
highrev_dir = df.groupby('director', as_index=False)['revenue_adj'].mean()
highrev_dir = highrev_dir.sort_values(by=['revenue_adj'], ascending=False).head(5)
highrev_dir
It's important to notice that we decided to keep the group of directors per row to check if there was any pattern in combination of directors that would stand out in the three analyses. However, the only director that appears in two top5 is Morten Tyldum.
We then decided to separate directors in different rows to see if there is any change in the results.
# Sort dataframe based on revenue_adj
df.sort_values(by=['revenue_adj'], ascending=False)
df.head(10)
# Selecting rows with just one director
one_dir = df[~df['director'].str.contains('\|')]
# Let's check the first 5 rows
one_dir.head(5)
# Let's check the shape of "one_dir" table
one_dir.shape
# Selecting rows with more than one director
multi_dir = df[df['director'].str.contains('\|')]
# Let's check the first 5 rows
multi_dir.head(5)
# Let's check how many rows and columns are in this multi_dir table
multi_dir.shape
# Let's check the maximum number of directors per movie we have in our clean dataset
n_dir = multi_dir['director'].str.count('\|').max() + 1
n_dir
Looks like a movie can have up to 12 directors in our clean dataset.
# Let's look where is that maximum number of directors
idmax = multi_dir['director'].str.count('\|').idxmax()
multi_dir.loc[idmax,].director
# Let's define a function that would only return one director (at a time) per movie
def single_dir(num_dir):
dfx = multi_dir.copy(deep=True)
if num_dir == -1:
dfx['director']=dfx['director'].apply(lambda x: x.split('|')[-1])
else:
dfx['director']=dfx['director'].apply(lambda x: x.split('|')[num_dir : num_dir+1][0] if (len(x.split('|')[num_dir : num_dir+1]) > 0) else np.NaN)
return dfx
# Let's define a function that creates a dataframe with directors separated (one director per row for every movie)
def create_df_singledir(num_dir):
multi_dir_sep = pd.DataFrame(columns=list(multi_dir.columns))
for i in range(1,num_dir+1):
rows = single_dir(i*-1)
multi_dir_sep = multi_dir_sep.append(rows, sort=False)
return multi_dir_sep.reset_index(drop=True).dropna()
# Let's create our dataframe with multiple directors separated in one row each
multi_dir_separated = create_df_singledir(n_dir)
# Let's check top 5 rows of our new table
multi_dir_separated.sort_values(by=['revenue_adj'], ascending=False).head(5)
# Let's check there is not any duplicated value in our table
multi_dir_separated.duplicated().any()
# Let's append the multi_dir_separated table to one_dir table to begin our analysis
df_dir = one_dir.append(multi_dir_separated, sort=False)
# Let's see how many rows and columns are in our final table
df_dir.shape
Now that we have our table with the director separated, let's do our three top5 analyses once again.
# Let's find the top 5 directors with highest popularity in our clean dataset
pop_dir1 = df_dir.groupby('director', as_index=False)['popularity'].mean()
pop_dir1 = pop_dir1.sort_values(by=['popularity'], ascending=False).head(5)
pop_dir1
# Let's find the top 5 directors with highest vote_average in our clean dataset.
highvote_dir1 = df_dir.groupby('director', as_index=False)['vote_average'].mean()
highvote_dir1 = highvote_dir1.sort_values(by=['vote_average'], ascending=False).head(5)
highvote_dir1
# Let's find the top 5 directors with highest revenue movies in our clean dataset.
highrev_dir1 = df_dir.groupby('director', as_index=False)['revenue_adj'].mean()
highrev_dir1 = highrev_dir1.sort_values(by=['revenue_adj'], ascending=False).head(5)
highrev_dir1
We see that the three top5 varies after separating one director per movie, but we dont see any director that is present in more than one top 5.
Let's show our findings with graphs.
# Let's plot the Top 5 directors based on popularity
func_bar1(pop_dir1['director'], pop_dir1['popularity'], 'Top 5 Movie Directors with Highest Average Popularity', 'Directors', 'Average IMDb Popularity Score', 0.08, 0.1, 17, aver_pop, 'General Movie Mean')
# Let's check how more popular the movies from Colin Trevorrow, in average, are compared to the general mean
CT_pop = pop_dir1[pop_dir1['director']=='Colin Trevorrow']['popularity']
diff_pop = (CT_pop - aver_pop)/aver_pop
diff_pop
In our clean dataset, director Colin Trevorrow has the highest average Popularity score in his movies, as the bar chart above shows.
This director had an average popularity for his movies 13 times higher than the general average.
# Let's plot the Top 5 directors based on vote_average
func_bar1(highvote_dir1['director'], highvote_dir1['vote_average'], 'Top 5 Movie Directors with Highest Average Vote', 'Directors', 'Average Vote', 0.08, 0.06, 17, aver_vote, 'General\nMovie\nMean')
# Let's check how greater vote_average the movies from Damien Chazelle receive compared to the general mean
DC_vote = highvote_dir1[highvote_dir1['director']=='Damien Chazelle']['vote_average']
diff_vote = (DC_vote - aver_vote)/aver_vote
diff_vote
In our clean dataset, director Damien Chazelle has the highest average vote_average in his movies, as the bar chart above shows.
However, his movies' average vote is just 33% higher than the general average.
# Let's plot the Top 5 directors based on average_revenue
func_bar1(highrev_dir1['director'], highrev_dir1['revenue_adj']/1e6, 'Top 5 Movie Directors with Highest Average Revenue (in terms of 2010 dollars)', 'Directors', 'Average Revenue in terms of 2010 dollars (values in millions)', 0.1, 5, 17, aver_rev/1e6, 'General Movie Mean')
# Let's check how greater average revenue the movies from Irwin Winkler receive compared to the general mean
IW_rev = highrev_dir1[highrev_dir1['director']=='Irwin Winkler']['revenue_adj']
diff_rev = (IW_rev - aver_rev)/aver_rev
diff_rev
In our clean dataset, director Irwin Winkler has the highest average revenue (in terms of 2010 dollars) in his movies, as the bar chart above shows.
This director has an average revenue 11 times higher than the general mean.
Before starting the analysis, we will need to process the "genres" column so each movie has one genre per movie, but still has all of its genres present in the table.
# Selecting rows with just one genre
one_genre = df[~df['genres'].str.contains('\|')]
# Let's check the first 5 rows
one_genre.head(5)
# Let's check how many rows and columns this one_genre table has
one_genre.shape
# Selecting rows with more than one genre
multi_genre = df[df['genres'].str.contains('\|')]
# Let's check the first 5 rows
multi_genre.head(5)
# Let's check how many rows and columns are in this multi_genre table
multi_genre.shape
# Let's check the maximum number of genres per movie we have in our clean dataset
n_gen = multi_genre['genres'].str.count('\|').max() + 1
n_gen
Looks like a movie can have up to 5 genres in our clean dataset.
# Let's look where is that maximum number of genres
idmax1 = multi_genre['genres'].str.count('\|').idxmax()
multi_genre.loc[idmax1,].genres
# Let's define a function that would only return one genre (at a time) per movie
def single_genre(n_gen):
dfx = multi_genre.copy(deep=True)
if n_gen == -1:
dfx['genres']=dfx['genres'].apply(lambda x: x.split('|')[-1])
else:
dfx['genres']=dfx['genres'].apply(lambda x: x.split('|')[n_gen : n_gen + 1][0] if (len(x.split('|')[n_gen : n_gen + 1]) > 0) else np.NaN)
return dfx
# Let's define a function that creates a dataframe with genres separated (one genre per row for every movie)
def create_df_singlegen(n_gen):
multi_genre_sep = pd.DataFrame(columns=list(multi_genre.columns))
for i in range(1, n_gen + 1):
rows = single_genre(i*-1)
multi_genre_sep = multi_genre_sep.append(rows, sort=False)
return multi_genre_sep.reset_index(drop=True).dropna()
# Let's create our dataframe with multiple genres separated in one row each
multi_genre_separated = create_df_singlegen(n_gen)
# Let's check the top 5 rows of our new table
multi_genre_separated.sort_values(by=['revenue_adj'], ascending=False).head(5)
# Let's check there is not any duplicated value in our table
multi_genre_separated.duplicated().any()
# Let's append the multi_genre_separated table to one_genre table to begin our analysis
df_genre = one_genre.append(multi_genre_separated, sort=False)
# Let's see how many rows and columns are in our final table
df_genre.shape
Now that we have our table with the genres fixed, we begin our analysis for this question.
# Let's check how many unique genres we have
df_genre['genres'].nunique()
# Let's create a table that counts how many movies are per genre in our clean dataset
count_genre = df_genre.groupby('genres', as_index=False)['id'].count()
count_genre = count_genre.sort_values(by=['id'], ascending=False).reset_index(drop=True)
count_genre.rename(columns={'id':'count'}, inplace=True)
count_genre
# Let's plot a bar chart to explain better our findings
func_bar1(count_genre['genres'], count_genre['count'], 'Distribution of Movie per Genre', 'Genres', 'Amount of Movies', 0.25, 15, 17,None,None)
In the bar chart above we can see that the Drama genre is the predominant one since it has the higher amount of movies that belong to that genre in our clean dataset.
# Let's find the top 5 genres with highest popularity in our clean dataset.
pop_gen = df_genre.groupby('genres', as_index=False)['popularity'].mean()
pop_gen = pop_gen.sort_values(by=['popularity'], ascending=False).head(5)
pop_gen
# Let's plot our Top 5 Genres by popularity
func_bar1(pop_gen['genres'], pop_gen['popularity'], 'Top 5 Movie Genres with Highest Average Popularity', 'Genres', 'Average IMDb Popularity Score', 0.08, 0.01, 17, aver_pop, 'General Movie Mean')
# Let's check how more popular the SciFi and Adventure movies, in average, are compared to the general mean
SFA_pop = pop_gen[pop_gen['genres']=='Science Fiction']['popularity']
diff_pop = (SFA_pop - aver_pop)/aver_pop
diff_pop
From the bar chart above, we can see that movies from the Adventure and Science Fiction genres have the highest average Popularity score, in our clean dataset. The movies of these two genres had an average popularity 57% higher than the general average.
# Let's find the top 5 genres with highest vote_average in our clean dataset
highvote_gen = df_genre.groupby('genres', as_index=False)['vote_average'].mean()
highvote_gen = highvote_gen.sort_values(by=['vote_average'], ascending=False).head(5)
highvote_gen
# Let's plot our Top 5 Genres by vote_average
func_bar1(highvote_gen['genres'], highvote_gen['vote_average'], 'Top 5 Movie Genres with Highest Average Vote', 'Genres', 'Average Vote', 0.08, 0.02, 17, aver_vote, 'Mean')
# Let's check how greater vote_average the Documentary movies receive compared to the general mean
doc_vote = highvote_gen[highvote_gen['genres']=='Documentary']['vote_average']
diff_vote = (doc_vote - aver_vote)/aver_vote
diff_vote
From the bar chart above, we can see that movies from the Documentary genre have the highest average vote_average, in our clean dataset.
However, movies' from this genre have an average vote just 7% higher than the general average.
# Let's find the top 5 genres with highest revenue in our clean dataset.
highrev_gen = df_genre.groupby('genres', as_index=False)['revenue_adj'].mean()
highrev_gen = highrev_gen.sort_values(by=['revenue_adj'], ascending=False).head(5)
highrev_gen
# Let's plot the Top 5 genres based on revenue_adj
func_bar1(highrev_gen['genres'], highrev_gen['revenue_adj']/1e6, 'Top 5 Movie Genres with Highest Average Revenue (in terms of 2010 dollars)', 'Genres', 'Average in terms of 2010 dollars (values in millions)', 0.1, 2, 17, aver_rev/1e6, 'General Movie Mean')
# Let's check how greater average revenue the Animation movies receive compared to the general mean
AN_rev = highrev_gen[highrev_gen['genres']=='Animation']['revenue_adj']
diff_rev = (AN_rev - aver_rev)/aver_rev
diff_rev
From the bar chart above, we see that movies from the Animation genre have the highest average revenue (in terms of 2010 dollars), in our clean dataset.
Also, movies from this genre have an average revenue 112% higher than the general mean.
# Let's create a table that counts the amount of movies per genre per year
count_gen_year = df_genre.groupby(['release_year','genres'], as_index=False)['id'].count()
count_gen_year.rename(columns={'id':'count'}, inplace=True)
count_gen_year.head(5)
# Let's reshape our count_gen_year table for our plot
table1 = count_gen_year.pivot(index='release_year', columns='genres', values='count')
table1 = table1.fillna(0)
table1.head(5)
# Let's do our plot
table1.plot(figsize=(16, 9), title='Evolution of Movie Genre per year', cmap='tab20c')
plt.ylabel("Amount of movies per genre")
plt.xlabel("Years");
This graph is confusing given the amount of genres displaying, but we can spot that the Drama genre is somehow predominant through the mayority of the years.
# Let's check the amount of movies per genre in 2015
table1.tail(1)
# Let's do a plot showcasing "Drama" genre
fig, ax = plt.subplots(figsize=(16,10))
# Multiple line plot
for column in table1:
plt.plot(table1.index, table1[column], marker='', color='grey', linewidth=1, alpha=0.4)
#Now we re do the 'Drama' curve, but bigger with distinct color
plt.plot(table1.index, table1['Drama'], marker='', color='orange', linewidth=4, alpha=0.7)
# Annotate the plot
num = 0
heights=[]
for i in table1.values[-1].tolist():
name = list(table1)[num]
#We will avoid genre labels overlapping by only showing the first genre in alphabetical order
if i not in heights:
heights.append(i)
if name != 'Drama':
ax.text(list(table1.index)[-1]+0.3, i, name, horizontalalignment='left', size='small', color='grey')
num += 1
# Add a special annotation for the 'Drama' genre
ax.text(list(table1.index)[-1]+0.3, list(table1['Drama'].values)[-1], 'Drama', horizontalalignment='left', color='orange')
# Add titles and labels
plt.title("Evolution of Drama genre vs other genres")
plt.xlabel("Years")
plt.ylabel("Amount of movies per genre")
# Seting x axis limits
ax.set_xlim(list(table1.index)[0], list(table1.index)[-1]+5)
plt.show();
In the line plot above we can see that the "Drama" genre has been predominant among other movie genres through the years, in our clean dataset.
To analyze which genre is the most popular through the years, we will use the IMDb Popularity Score as it takes into account more factors than the "vote_average" column.
# Let's create a table that gets the average popularity score per genre per year
avg_pop_year = df_genre.groupby(['release_year','genres'], as_index=False)['popularity'].mean()
avg_pop_year.rename(columns={'popularity':'avg_pop'}, inplace=True)
avg_pop_year.head(5)
# Let's reshape our avg_pop_year table for our plot
table2 = avg_pop_year.pivot(index='release_year', columns='genres', values='avg_pop')
table2 = table2.fillna(0)
table2.head(5)
# Let's check the populary score by genre in 2015
table2.tail(1)
# Let's do our plot
table2.plot(figsize=(16, 9), title='Evolution of Popularity per genre per year', cmap='tab20c')
plt.ylabel("Average Popularity Score")
plt.xlabel("Years");
In the graph above we can see that it's difficult to distinguish which movie was more popular than others throughout the years, as the popularity per movie varies a lot from year to year.
However, let's showcase Animation, as it had the highest average popularity (among all years), and Science Fiction, as it had the highest average popularity in 2015.
# Let's do a plot showcasing "Animation" and "Science Fiction" genre
fig, ax = plt.subplots(figsize=(16,10))
# Multiple line plot
for column in table2:
plt.plot(table2.index, table2[column], marker='', color='grey', linewidth=1, alpha=0.4)
# Re do the "Animation" curve, but bigger with distinct color
plt.plot(table2.index, table2['Animation'], marker='', color='orange', linewidth=4, alpha=0.7)
# Re do the "Science Fiction" curve, but bigger with distinct color
plt.plot(table2.index, table2['Science Fiction'], marker='', color='blue', linewidth=4, alpha=0.7)
# Annotate the plot
num = 0
heights=[]
for i in table2.values[-1].tolist():
name = list(table2)[num]
# Avoid genre labels overlapping by only showing the first genre of the same 2015 value (rounded), in alphabetical order
if round(i,0) not in heights:
heights.append(round(i,0))
if name != 'Animation' and name != 'Science Fiction':
ax.text(list(table2.index)[-1]+0.3, i, name, horizontalalignment='left', size='small', color='grey')
num +=1
# Add a special annotation for the "Animation" genre
ax.text(list(table2.index)[-1]+0.3, list(table2['Animation'].values)[-1], 'Animation', horizontalalignment='left', color='orange')
# Add a special annotation for the "Science Fiction" genre
ax.text(list(table2.index)[-1]+0.3, list(table2['Science Fiction'].values)[-1], 'Science Fiction', horizontalalignment='left', color='blue')
# Add titles and labels
plt.title("Evolution in Popularity of Animation genre vs other genres")
plt.xlabel("Years")
plt.ylabel("Average IMDb Popularity Score")
# Seting x axis limits
ax.set_xlim(list(table2.index)[0], list(table2.index)[-1]+5)
plt.show();
In the line plot above we can see that even though Animation had the highest average popularity score among all genres, in our clean dataset, there are years that its movies had zero score. This is probably related to that fact that there wasn't any Animation release during those years between the 60s and 90s.
On the other hand, we can see that Science Fiction also had some years with zero popularity score. This is also probably related that there weren't movies in that genre during some years in the 60s.
Both genres show a lot of variability through the years.
From the univariate and multivariate analysis, we can conclude the following from our clean dataset:
Limitations of this Project:
References:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])